Tree Based Methods

11/10/2020

Agenda

Regression trees
Classification trees
Bagging and Random Forests
Boosting trees
Variable Importance measures

Recap

Deep trees: flexible fitters (non-linearities, interactions), can overfit
Shallow trees: not so flexible, can underfit
Ensemble methods: we can improve the bias-variance tradeoff at the expense of some loss of interpretation

Bagging

Recall Bootstrap

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. We introduce it here because it is particularly useful and frequently used in the context of decision trees.

Recall that given a set of \(n\) independent observations \(X_1, \dots, X_n\), each with variance \(\sigma^2\), the variance of the mean \(\overline{X}\) of the observations is given by \(\frac{\sigma^2}{n}\). In other words, averaging a set of observations reduces variance
Now imagine that these “observations” are the predictions given by a statistical methods. Of course, this is not practical because we generally do not have access to multiple training sets to average over

Bagging

However, we can bootstrap by taking repeated samples from the (single) training data set.

To Bootsrap Aggregate (Bag) we:

Take \(B\) bootstrap samples from the training data, each of the same size as the training data. This will give us \(B\) training data sets
Fit a large tree to each bootstrap sample. This will give us \(B\) trees
Make prediction for a new point \(x\), say \(\hat{f}^{(b)}(x)\) for each one of the \(B\) trees
Combine the results from each of the \(B\) trees to get an overall prediction, i.e. \[\hat{f}_{bag}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^{(b)}(x)\]

How to average?

For numeric \(y\) we can combine the results easily by making our overall prediction the average of the predictions from each of the \(B\) trees
For categorical \(y\), it is not quite so obvious how you want to combine the results from the different trees:
- Majority vote: get a prediction for \(x\) from each tree, and the class that gets the most “votes” (out of \(B\) “ballots”) is the predicted class
- Average the probabilities \(\hat{p}^{(b)}\) from each tree (what most software does)

Why does it work?

We “randomize” our data and then build a lot of big, and hence noisy, trees:

The relationships which are real get captured in a lot of trees and hence do not wash out when we average
Stuff that happens “by chance” is idiosyncratic to one (or a few) trees and washes out in the average
You need \(B\) big enough to get the averaging to work, but it does not seem to hurt if you make \(B\) bigger than that
The cost of having very large \(B\) is in computational time

Out-of-Bag Error Estimation

It turns out that there is a very straightforward way to estimate the test (out-of-sample) error of a bagged model
One can show that, on average, each bagged tree makes use of about \(2/3\) of the observations
The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations
We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around \(B/3\) predictions for the \(i^{th}\) observation, which we average
This estimate is essentially the leave one out CV error for bagging, if \(B\) is large
By carefully keeping track of which bagged trees use which observations you can get out-of-sample predictions

Out-of-Bag Error Estimation

For \(b = 1, \dots, B\):
- Take a bootstrap sample \(\mathbf{x}^{(b)} = (x_1^{(b)}, \dots, x_n^{(b)})\) to use as a training set
- Fit a regression tree \(\hat{f}^{(b)}\)
For \(i = 1, \dots, n\):
- Set \(S = \{b : (x_i, y_i) \notin \mathbf{x}^{(b)}\}\)
- Calculate \[OOB_i = \left(y_i - \frac{1}{|S|} \sum_{b \in S} \hat{f}^{(b)}(x_i) \right)^2\]
Final OOB MSE estimate, \(MSE_{test} = \frac{1}{n} \sum_{i=1}^n OOB_i\)

Boston data example

Suggests you just need a couple of hundred trees!

Boston data example

Random forests

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees
As in bagging, we build a number of decision trees on bootstrapped training samples
But when building these decision trees, each time a split in a tree is considered, a random subset of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those \(m\) predictors
Typical choice for \(m\) is \(m = \sqrt{p}\)

Note: Bagging is Random Forests with \(m = p\).

Boston data example

Boosting trees

Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Boosting uses many trees, too. However, the trees are grown sequentially: each tree is grown using information from previously grown trees.

Fit the data with a single tree
Crush the fit so that it does not work very well
Look at the part of \(y\) not captured by the crushed tree and fit a new tree to what is “left over”
Crush the new tree. Your new fit is the sum of the two trees
Repeat the above steps iteratively. At each iteration you fit “what is left over” with a tree, crush the tree, and then add the new crushed tree into the fit
Your final fit is the sum of many trees

Boosting algorithm

Set \(\hat{f}(x) = 0\) and \(r_i = y_i\) for all \(i\) in the training set
For \(b = 1, 2, \dots, B\) repeat
- Fit a tree \(\hat{f}^{b}\) with \(d\) splits (\(d + 1\) terminal nodes) to the training data \((X, r)\)
- Update \(\hat{f}\) by adding in a shrunken version of the new tree \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^{b}(x)\)
- Update the residuals \(r_i \leftarrow r_i - \lambda \hat{f}^{b}(x_i)\)
Output the boosted model \[\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^{b}(x)\]

Why does it work?

Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly
Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals
Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter \(d\) in the algorithm
By fitting small trees to the residuals, we slowly improve \(\hat{f}\) in areas where it does not perform well. The shrinkage parameter \(\lambda\) slows the process down even further, allowing more and different shaped trees to attack the residuals

Choice of parameters

We have to choose:

\(\lambda\), shrinkage parameter: it makes each new tree a weak learner in the sense that is only does a little more fitting
\(B\), number of interations (the number of trees in the sum). Unlike bagging and random forests, boosting can overfit if \(B\) is too large. We use cross-validation to select \(B\)
\(d\), the size of each new tree, which controls the complexity of the boosted ensemble. Often \(d = 1\) works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model

Boosting for classification trees

Boosting for categorical \(y\) works in an analogous manner but it is more complicated in how you define what is “left over”.

We cannot simply use the residuals
We cannot just add up the fit

Boston data example

## [1] 571

Variable Importance measures

Variable importance measure

Ensemble methods can give dramatically better fits than simple trees.

Out-of-sample, they can work amazingly well
However, they are certainly not interpretable (you cannot look at hundreds or thousands of trees)

By computing summary measures, you can get some sense of how the trees work. In particular, we are often interested in which variables in \(x\) are really the “important” ones.

Variable importance measure

For a single tree, we can look at the splits (decision rules) and pick out the ones that use a particular variable. Then we can add up the reduction in RSS due to the splits using the variable. A large value indicates an important predictor
For bagging/random forests we can average the effect of a variable over the \(B\) trees
For Boosting we can sum the effects
Similarly, classification trees, we consider the total amount that the Gini index is decreased by splits over a given predictor

Variable importance measure

##             var    rel.inf
## lstat     lstat 32.9769020
## rm           rm 31.7265743
## dis         dis  9.6760567
## nox         nox  5.1820613
## crim       crim  4.9207831
## black     black  3.8421932
## ptratio ptratio  3.7491902
## age         age  3.1540710
## tax         tax  1.6917150
## chas       chas  1.0245005
## rad         rad  1.0180367
## indus     indus  0.9034953
## zn           zn  0.1344207

Recap

Decision trees are simple and interpretable models for regression and classification
However they are often not competitive with other methods in terms of prediction accuracy
Bagging, random forests and boosting are good methods for improving the prediction accuracy of trees. They work by growing many trees on the training data and then combining the predictions of the resulting ensemble of trees
The latter two methods - random forests and boosting - are among the state-of-the-art methods for supervised learning
However their results can be difficult to interpret

Agenda

Recap

Bagging

Recall Bootstrap

Bagging

How to average?

Why does it work?

Out-of-Bag Error Estimation

Out-of-Bag Error Estimation

Boston data example

Boston data example

Random forests

Boston data example

Boosting trees

Boosting trees

Boosting algorithm

Why does it work?

Choice of parameters

Boosting for classification trees

Boston data example

Boston data example

Variable Importance measures

Variable importance measure

Variable importance measure

Variable importance measure

Recap

Question time